Research on data imbalance in intrusion detection using CGAN

To address the problems of attack category omission and poor generalization ability of traditional Intrusion Detection System (IDS) when processing unbalanced input data, an intrusion detection strategy based on conditional Generative Adversarial Networks (cGAN) is proposed. The cGAN generates attack samples that approximately obey the distribution pattern of input data and are randomly distributed within a certain bounded interval, which can avoid the redundancy caused by mechanical data widening. The experimental results show that the strategy has better performance indexes and stronger generalization ability in overall performance, which can solve insufficient classification performance and detection omission caused by unbalanced distribution of data categories and quantities.


Introduction
As social behaviors become more intelligent, the network security boundary is increasingly blurred, and the attack methods and tools for network intrusion become more and more diverse.The effective protection and safe circulation of data have become key issues for the development of digital society.Among them, data imbalance processing is an important part of network intrusion attack detection that cannot be ignored.The real network attacks include many categories of attack behavior, but they seldom happen.The proportion of various types of attack data to the total traffic data is less than 0.1%.In general, the number of attacks is small.Consequently, many intrusion detection methods in the identification process cannot learn the complete data features, or even omit them.
In recent years, the outstanding feature-extraction capability of deep learning has attracted the research interest of many scholars in the field of intrusion detection.Zhang Y et al. [1] proposed a network intrusion detection method based on autoencoder and longand short-term memory neural network to address the problems of high data dimensionality and complex feature extraction process of traditional network intrusion detection methods.Zavrak S et al. [2], on the other hand, focused on detecting anomalous network traffic from network stream-based data using unsupervised and semi-supervised deep learning methods.Staudemeyer R C et al. [3] modeled network traffic as time series with supervised learning methods, using known normal and malicious behavior data for training and improving intrusion detection.Dai Yuanfei et al. [4] investigated the problem of degradation of detection model accuracy and long training time caused by redundancy or noise in existing intrusion detection algorithms, introduced feature selection algorithms into the field of intrusion detection, and proposed a feature selection-based intrusion detection method.Yin C L et al. [5] explored how to model intrusion detection systems based on deep learning, and proposed a method of deep learning using recurrent neural network for intrusion detection.However, in the above study, with class and number imbalance, the training effect was not informative and the result was more biased towards the majority class classification result, which led to a lower final attack recognition effect than expected and a larger bias in the detection result.
In imbalanced scenarios, not only do the class and quantity imbalance ratios change over time, but the relationships between classes also change.Moreover, the cost of misclassifying abnormal behavior data as normal behavior ones is usually higher than the cost of reversing the error.To address such problems, Chawla N V et al. [6] proposed a method for constructing classifiers from unbalanced datasets, Synthetic Minority Over-sampling Technique (SMOTE).Combining minority class data oversampling and majority class data undersampling in the curve space with receiver-operated feature yielded a classifier with better performance than the method that undersamples only the majority class.To further improve the metrics, Hui H et al. [7] proposed two new minority class oversampling methods, Borderli-neSMOTE1 and BorderlineSMOTE2, based on the SMOTE method.As such, only the minority class data near the boundary were oversampled.With this design, better recall and F-values were obtained.He H et al. [8] proposed a novel adaptive synthetic sampling approach for learning from unbalanced datasets.Yang Y et al. [9] proposed a novel network intrusion detection model with regularized supervised adversarial variational autoencoder, Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN).However, some critical issues still remained.
1.The class of the current network attack behavior is imbalanced.The existing methods are not effective enough to deal with such problem, resulting in missing classes, key information, and attacks, etc.The detection rate is ineffective.

2.
The incremental data obtained by the existing oversampling methods possess fewer real features, rigid distribution patterns, excessive information redundancy, and omission of data distribution details, which are far from the real network attack behavior data.
3. The features of a few classes of data are overlapped, so the classification tend to produce data annexation.The recognition results of existing classification methods are seriously biased towards the majority class classification results, and new attack behavior data cannot be identified when they appear, so it is difficult to distinguish the distribution patterns with fluctuations in a certain range.
To solve the above problems, we propose a network intrusion detection strategy based on conditional generative adversarial network, which takes real data as the ideal learning object, randomly simulates the feature distribution of minority class data in a certain bounded interval by generative adversarial network, and approximates the real data distribution through adversarial training to obtain a large amount of selectable training data and achieve minority class data enhancement.In this way, this paper solves the insufficient classification performance and detection omission caused by the unbalanced distribution of data categories and quantities in the classification problem.

Generative adversarial networks
Generative Adversarial Networks (GANs) [10] is a deep generative model that consists of two competing neural network models: The Generator (G) and the Discriminator (D).The GAN network architecture is shown in Fig 1.
The continuous adversarial training of model G and model D maximizes the probability of D to discriminate the source of training samples, and maximizes the similarity between the generated data generated by G and the real data.The training of D and G can be expressed as a two-sided game problem about the minimization and maximization of the value function, that is, the loss function of GAN, which is shown in Eq (1).
Goodfellow et al. [10] proved that there is an overall optimal solution to the two-sided game problem with minimal maximization when and only when p g = p data , i.e., a Nash equilibrium is reached.At this point, the generative model G learns the distribution of the real samples, so that the accuracy of the discriminative model D stays stably above 1/2, even if D can only make random guesses between 0 or 1 for the training samples.The parameters are updated by error gradient back propagation.

Conditional generative adversarial networks
The cGAN [11] is a variant of generative adversarial network.The input to the original generative adversarial network generator is a random noise signal, and the input to the discriminator is real data and generated data.The input to the conditional generative adversarial network generator is a combination of signals consisting of conditional information and random noise, and the input to the discriminator is the reconstructed data after splicing the conditional information from the real and generated data respectively.The training process is similar to that of the original generative adversarial network so it is not described here.The loss function of the conditional generative adversarial network is shown in Eq (2).With the development of the network, the technological methods of fusion anomaly detection and misuse detection models are gradually formed, as shown in Fig 3 .The captured network data is preprocessed, and data in the application layer are whitelisted through the whitelist database.If the matching is successful, the data in the application layer and the instruction data are put into the state machine for execution, and then the state analyses are performed.If not, the system collects status statistics directly.Finally, analyses and corresponding processing are carried out.At the same time, the abnormal patterns that cannot be associated should be added to the abnormal data set and fed back to the blacklist database and whitelist database for update through data mining.

Analyses of imbalanced data distribution
NSL-KDD [13] is a dataset proposed to solve the inherent problems in the KDD99 dataset.Many studies use it as an effective benchmark dataset to help researchers compare different intrusion detection methods.Therefore, the evaluation results of different research efforts based on the NSL-KDD dataset are consistent and comparable.
The NSL-KDD dataset includes five types of data, including normal traffic data and four major categories of attack behavior data.Each behavioral information can be divided into 43 dimensions according to data dimensions, among which from column 1 to column 41 are the characteristics of the network data flow itself.Column 42 is the category label, the irrelevant data for this experiment that need to be removed during data processing.
The analyses of the attack samples are shown in Fig 4 .And combined with the proportion analysis of Fig 5, the distribution of categories is extremely unbalanced, with 12 categories accounting for about 0.1%, which are guess_passwd(0.042%),buffer_overflow(0.024%),warezmaster(0.016%),land(0.014%),imap(0.009%),rootkit(0.008%),loadmodule(0.007%),ftp_write(0.006%),multihop(0.006%),phf(0.003%),perl(0.002%)and spy(0.002%).The percentage of each type of data varies greatly.The number of unbalanced classes accounts for 52.17% of the total number of classes, and the amount of data only accounts for 1.39% of the total amount of data, which tend to mislead the classifier and cause detection omission and inaccurate identification.In addition, since the percentage is small, even if the abnormal behavior is recognized, the recognition rate is too low to attract attention.
To further verify the effect of unevenly distributed data on the classifier, the decision tree model is used to classify the features of the KDDTrain+.txtfile.The classification results are shown in Table 1 with key information highlighted.It can be seen that the class distribution is extremely uneven, resulting in unsatisfactory categories of accuracy, recall, F-value and support, or even being 0, which does not have any reference value.The detection results do not show phf category attack behavior, which is a missed detection behavior, meaning a reduced reference value and effectiveness of the classification accuracy.In addition, in the multi-classification results of these 23 types of data, the macroavg value of precision is 0.72, the macroavg value of recall is 0.75, and the macroavg value of F-value is 0.72.It can be seen that the result is very unsatisfactory.In real network intrusion attacks, the distribution of attack behavior compared to normal behavior and this situation is similar.Therefore, great attention is needed to the problem of unbalanced data distribution, which is also the focus of this paper.

Solution strategy of imbalanced data of intrusion detection based on cGAN
Based on the above comprehensive analyses of the imbalanced data and the generation of the adversarial network, we propose a cGAN-based solution to the imbalance of intrusion detection data, as shown in Fig 6, to address the problem of missing data, coverage categories and low recognition rate of scarce data caused by the imbalance of attack types and quantities in network intrusion detection.
This strategy mainly includes four parts, namely generator module G, discriminator module D, target network F and identification network F Classification .The condition vector R label is flattened with Flatten layer and Embedding layer.Multiply layer is used to combine random noise R seed and R label , which are converted to the input data shape acceptable to generator G. Generator G is responsible for learning the distribution rules of attack behavior and normal behavior data, extracting the probability distribution characteristics, and then generating an expanded attack sample that approximates the distribution of attack behavior data.At the same time, G receives feedback from the loss function and outputs G data and its corresponding label information G label that fits the potential spatial distribution pattern of the input data.The generator tries to "trick" the discriminator by generating samples so that it cannot correctly distinguish between real data samples and generated data samples.The target network F is composed of intrusion data.This paper uses the NSL-KDD dataset.In the F network, attacks are unevenly distributed.After numericalization, data standardization, and normalization, F data and F label are generated to satisfy the shape of discriminator input data.The number of F data and F label is small in the total network traffic.The dimension will increase with the degree of use of the network.The variety will increase with the diversification of attack methods and attack tools.F data and F label tend to be ignored in the normal network intrusion detection system, which are the blind spots of the defense identification system.The discriminator D receives the generated sample G data , the generated sample label information G label , the real data attack behavior F data , and the real data attack behavior data label F label , and performs the judgment in the direction favorable to minimizing the loss.The discriminator D expects to discriminate the source of the input information, i.e., effectively distinguish G data and F data .The discriminator result feeds optimization information through G label and F label , which in turn improves the generator generation effect, fine-tunes the generation direction, and optimizes the fitting effect.The discriminator will use the computational advantages of deep learning in data processing to continuously reduce its loss, improve the model's comprehensive performance, and improve the generalization ability.The generator and the discriminator will improve their own generation effect and discriminative ability through loss change respectively, and finally reach the Nash equilibrium state.

The selection of attack samples
The flowchart of the attack algorithm is shown in Fig 7 .F network consists of the data with characteristics of attack behavior that attack the network, forming an attack sample.One of the attack behavior data of the target network is selected from the set of attack algorithm, until the number of attack samples size equal to num.data is network traffic data, including normal behavior and attack behavior.label is the normal behavior and attack behavior category information.F data is the generated data by the model that matches the distribution characteristics of the attack behavior in the real traffic of the network.F label is the corresponding attack behavior category information, and num is a fixed value that can satisfy the classification index of the network attack behavior based on experience.

Experimental environment and parameter settings
The computer system is Windows 10.The processor is Intel(R) Core (TM) i9-10920X CPU@3.50GHz with a running memory (RAM) of 128GB and dual NVIDIA Geforce RTX 3090 GPUs.The computer is equipped with python 3.7 and Tensorflow2.2framework.
In the training process, the generator and discriminator of cGAN adopt a combination of convolutional network and Batch Normalization network unit structure as the generator and discriminator architecture.In the generator and discriminator, a "small and deep" In the training process, the generator and discriminator are trained alternately using a batch training method, with the batch set to EPOCHS and the batch size set to BATCH_SIZE.When training the generator, the parameters of the generator network are set empirically, and the random noise z and the conditional vector c of size noise_dim are obtained, where noise_dim is the artificially set random noise dimension information.The random noise z and the conditional vector c are spliced and input into the generator to generate samples with the same dimension as the attack behavior features, and passed into the discriminator and classifier.The Adam optimizer is used to optimize the loss function of the generator, back-propagate and update the parameters of the generator.This step is repeated several times until the parameters cannot be further optimized.When training the discriminator, the discriminator network parameters are set empirically, the noise_dim size data are obtained from the training samples and input to the discriminator to optimize the corresponding loss function.The noise_dim size noise z and the conditional vector c are obtained and input to the generator to generate the generated samples with the same dimensionality as the attack behavior features.The loss function is back-propagated and the parameters of the discriminator are updated.The procedure is repeated until the parameters cannot be further optimized.Repeating the alternating training of the generator and discriminator until the network training is completed so that the valid attack behavior data can be "faked".Each complete cycle of training samples completes one round of training, and the relevant parameters of the generator and discriminator are saved.

Experimental results and analysis of data of minority class
As can be seen from Fig 8, after about 500 epochs, the model reaches game stability and achieves the corresponding dynamic balance between the generator and the discriminator, at which time the generative adversarial network structure has reached the best training effect.In principle, at this time, the generated samples of the generator can no longer be quickly and accurately identified by the discriminator.The discriminator has sufficient discriminatory ability, i.e., the distribution of the generated samples at this time is close to the original data samples.The generator part of the model is saved separately.Data of imbalanced distribution can be called at any time and generated to expand the operation to make up for the unbalanced data, thus ensuring the effectiveness of the sample and detection efficiency.
In order to carry out unbalanced sample augmentation, the number of generated samples is set at 900 groups.900 groups are randomly selected from the original dataset, and a new dataset is formed by using the ratio of attack behavior data: normal behavior data approximately equal to 1:1 for binary classification validation of the decision tree.Taking the buffer_overflow data as an example, the validation results are shown in Table 2.The accuracy can reach 99.46% at minimum.The precision rate is 0.99.The recall rate is 1.00.The F value is 0.99.The support degree is 463.The indicators show that the generated samples have the distribution characteristics of real attack behaviors, which is sufficient to replace the real samples for model training.
After comparing the accuracy, recall, F-value and support values of the data set classified by the decision tree algorithm before and after using the model proposed, as shown in Table 3, the experimental results of classification validity proved that the accuracy value changed from 1.00 to 0.99, the recall value changed from 0.60 to 1.00, the F-value changed from 0.75 to 0.99, and the support changed from 1 to 463.The decrease in the accuracy rate is due to the increase in the number of samples of a few classes, which "dilutes" the probability of the original data masking the attack data.The change in other data is extremely obvious.
In this paper, KFold (n_splits = 10) cross-validation method is used and the accuracy validity of the generated sample cases is verified using the feature selection evaluation experiment.The results are shown in Fig 9.
As can be seen from the

Experimental results and analysis of 23 classification based on NSL-KDD dataset
First, guess_passwd, buffer_overflow, warezmaster, land, imap, rootkit, loadmodule, ftp_write, multihop, phf, perl, and spy, a total of 12 attacks, are used respectively using the cGAN-based network intrusion.The data distribution after the augmentation is shown in Table 4. From the change of the augmentation ratio, we can see that the more serious the unbalanced distribution in the original data set is, the larger the augmentation ratio is, which indicates that the strategy proposed has research significance for this kind of problem.Finally, we use the decision tree for 23 classifications to verify the effectiveness of the strategy proposed, and compare the data without the strategy in this paper.The results are shown in Table 5.
Table 5 shows that the precision rate, recall rate, F-value and support rate in the identification results of unbalanced intrusion data after the optimization of cGAN-based intrusion detection strategy are much higher than the precision rate and recall rate of direct classification without processing.The macro average precision rate rises from 0.72 to 0.93, an increase of 29.17%.The macro average recall rate rises from 0.75 to 0.93, an increase of 24%.The macro average F-value rises from 0.72 to 0.93, an increase of 29.17%, and the macro average support rises from 12598 to 13713, an increase of 8.85%.Under the premise that the data is valid in the network dataset after the sample augmentation is generated, the cGAN-based intrusion detection strategy has significantly improved the classification of network attacks and can effectively process the unevenly distributed data to improve the defense capability of the system.
For the classification experiments, accuracy is the clearest indicator of the performance of the strategies and models.As shown in Table 6, it can be clearly seen that the multi-classification accuracy of the proposed strategy reaches 98.65%, which is the highest among all the compared methods, and is at least 3.87% better compared to other algorithms.
In the presence of severe data categories and quantitatively unbalanced distribution, the accuracy rate can no longer fully represent the performance of the model, so other metrics need to be introduced.Therefore, in this paper, the accuracy rate, recall rate and F-value are used as progressive analysis metrics.The specific comparison results are shown in the following table and Figure.
The accuracy rate represents the relative correctness of the prediction results, i.e., the cases in which the samples predicted as attacks are really attacks, and is used to measure whether there are cases of misclassification.As can be seen from Table 7, regarding the precision rate results of the prediction of network behaviors in the classes Normal, Prob, Dos, U2R, and R2L, the scores of the strategy proposed are 1.000, 0.9775, 0.9817, 0.8325, and 0.9012, respectively.Among them, the precision rates for the classes Prob, Dos, U2R, and R2L are not the highest values, but the difference is small.The network intrusion attack behavior is not a single type of attack, so the overall performance of the model is required to be higher.Based on the accuracy rate values and the comprehensive data, the overall performance of the proposed strategy is much better than other intrusion detection methods.
The recall rate, which indicates the situation that a certain type of behavior in the dataset, is correctly predicted.Since the number of normal behaviors in the dataset is too high, this scenario will lead to a high accuracy of all model predictions but a weak identification of attack behaviors, which is not the desired result.At this point, the value of the recall rate comes into play to measure the ability to predict attack behaviors and check whether there are any omissions.As can be seen from Table 8, regarding the accuracy rate results of network behavior prediction in Normal, Prob, Dos, U2R, and R2L classes, the scores of the strategies in this paper are 1.0000, 0.9800, 0.9850, 0.8325, and 0.9150, respectively, with Normal, Prob, Dos, and R2L classes being the highest scores.Although the RELM algorithm scores a little higher in the U2R category, it has lower recall scores in the Normal, Dos and R2L categories.The overall performance of the proposed strategy is much better than other intrusion detection methods, based on the recall values and the comprehensive data.
The F-value is a criterion obtained by combining the precision and recall rates.The higher the F-value, i.e., the higher the summed average of precision and recall, the better the model.From Table 9, it can be seen that the scores of our strategy are 1.0000, 0.9775, 0.9833, 0.8325, and 0.9063 for the accuracy rate results of network behavior prediction in the Normal, Prob, Dos, U2R, and R2L categories, respectively.The F-value of the R2L class is significantly less than the strategy proposed, and the comprehensive performance of the proposed strategy is much better than other intrusion detection methods based on the F-value as the criterion and the comprehensive data.

Comparison experiments on UNSW-NB15 dataset
In order to demonstrate the generalization ability of the method proposed, this paper is validated on the UNSW-NB15 dataset.The experimental setup and evaluation indexes are the same as those of the NSL-KDD dataset.The distribution of the number of attack categories in the UNSW-NB15 dataset is shown in Fig 11 .Among them, the attack behavior data of Analysis, Backdoor, Shellcode and Worms categories account for 1.14%, 1.00%, 0.65% and 0.07% of the total number, respectively, which are highly unbalanced in terms of quantitative This part only aims to verify whether the method is valid in the UNSW-NB15 dataset and has sufficient generalization ability, so other hyperparameters are kept consistent with the NSL-KDD dataset experiments.The training batch EPOCHS is set to 2500.
To verify that the generated category data have sufficient characteristics to simulate the features of real category data, the binary classification validity experiments are conducted.The results are shown in Table 10 below, where the normal and attack attributes of the data can be efficiently discriminated.The generated data have the distribution characteristics of each type of attack data, which can solve the data omission and annexation problems caused by the unbalanced distribution.
The comparison results are shown in Table 11 and Fig 12 .In the Backdoor and Worms classes, our strategy achieves the highest detection rates of 0.82 and 0.92, and performs slightly   lower than MultinomialNB and ADASYN in the DoS and Analysis classes.However, Multino-mialNB and ADASYN perform worse in the remaining classes.Combining the final detection rates of the four classes of data, we can see that the strategy presented has the best detection performance index.

Conclusions
In this paper, an intrusion detection method based on conditional generative adversarial networks is proposed.Taking advantage of the generative adversarial network model in data representation and distribution learning, the few classes of attack behavior data that are difficult to identify are augmented to ensure a random distribution within a certain bounded interval under the premise of having the potential spatial distribution pattern of real attack data.
Based on the proposed strategy, the complex loss functions required to design existing deep learning methods are avoided by adding pooling layers, reducing the size of convolutional kernels, deepening the network structure, and adopting a batch training approach.The experimental results show that the proposed strategy presented can efficiently solve the problem of unbalanced data category and quantity share of attack behaviors in real network data streams, enhance the robustness of the classifier when facing different types of attack input data, improve the effectiveness of the intrusion detection model when identifying unbalanced data distribution, and enhance the generalization ability of the defense model to attack behaviors from different data sources.

Fig 9 ,
the results of the cross-validation feature selection evaluation test indicate that the classification results are above 99.65%accuracy, indicating that the generated adversarial sample distribution fits the real attack behavior distribution more than expected.It can be concluded that the cGAN-based intrusion detection strategy can effectively solve the problem of unbalanced data and improve the intrusion detection effect.

Fig 12 .
Fig 12.Comparison of the detection performance of different classification methods and oversampling methods with the strategy of this paper on the UNSW-NB15 dataset.https://doi.org/10.1371/journal.pone.0291750.g012

Table 9 . Comparison of F-values of multi-classification results for different intrusion detection models.
https://doi.org/10.1371/journal.pone.0291750.t009distribution compared to other types of attack data.Therefore, the experiments and comparisons are validated with Analysis, Backdoor, Shellcode and Worms types.

Table 11 . Comparison of the detection performance of different classification methods and oversampling methods with the strategy proposed on the UNSW-NB15 dataset.
https://doi.org/10.1371/journal.pone.0291750.t011